132 ◾ Bioinformatics
With the FASTA sequence, we can also download the sequence dictionary file. Once we
have downloaded the FASTA sequence of the reference genome, we can index the sequence
with both samtools and bwa. The following script first creates the directory “refgenome”,
and then, in that directory, it downloads the sequence of the current human genome and
its dictionary file from GATK resource bundle and indexes the FASTA:
mkdir refgenome
cd refgenome
wget https://storage.googleapis.com/genomics-public-data/
resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta
wget https://storage.googleapis.com/genomics-public-data/
resources/broad/hg38/v0/Homo_sapiens_assembly38.dict
#or create a fasta dict for reference
f=$(ls *.fasta)
samtools faidx ${f}
bwa index ${f}
cd ..
Only human sequences are available in the GATK resource bundle. However, if you down-
load the sequence of a reference genome from a different database, you may need to create
a dictionary for that sequence using Picard software as follows:
java -jar ~/software/picard.jar \
CreateSequenceDictionary \
R=Reference.fasta \
O=Reference.dict
4.2.2.2.3 Mapping reads to the reference genome
The second step is to align the raw reads to the reference genome that we have downloaded
and indexed in the previous step. We can use the aligner of our choice. The most com-
monly used one is “bwa mem”. The following script creates the directory “sam” and saves
the SAM files that contain the read mapping information into that directory. There will be
a SAM file for each individual.
mkdir sam
cd fastq
ref=$(ls ../refgenome/*.fasta)
for i in $(ls *.fastq | rev | cut -c 13- | rev);
do
bwa mem -M -t 4 \
-R “@RG\tID:${i}\tSM:${i}” \
${ref} \
${i}_chr21.fastq > \
../sam/${i}_chr21.sam 2> ../sam/${i}_chr21.log;
done
cd ..